import json
import requests
from warcio import ArchiveIterator
Looking at the WAT, WET and WARC Common Crawl Archives
Common crawl provides 3 types of archive:
- WAT contains metadata of the crawl: The headers, things from the
<head>
of html (title, meta, scripts) and links from the website - WET contain the text extracted from the HTML of the crawl, in format Title
- WARC contains the entire crawl, the metadata and HTML Response
These are all in the Web Archive (WARC) format. Common crawl have a good introduction to WARC There is specification for the gory details.
Read the associated article for details, and the Jupyter Notebook.
From the spec here are the types of records:
- warcinfo - contains information about the web crawl
- metadata - record contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types.
- conversion - record shall contain an alternative version of another record’s content that was created as the result of an archival process.
- response - response
- request - details of a request
- resource - record contains a resource
- revisit - describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record.
- continuation - appended to corresponding prior record block(s) (e.g., from other WARC files) to create the logically complete full-sized original record.
Let’s take some sample WARC url and the corresponding WET and WAT urls. The WET and WAT are generated from the full WARC and have derived URLs.
= 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-24/segments/1590347387219.0/warc/CC-MAIN-20200525032636-20200525062636-00381.warc.gz'
warc_url = warc_url.replace('/warc/', '/wet/').replace('warc.gz', 'warc.wet.gz')
wet_url = warc_url.replace('/warc/', '/wat/').replace('warc.gz', 'warc.wat.gz') wat_url
Reading WARC
= requests.get(warc_url, stream=True)
r = ArchiveIterator(r.raw) records
First record is warcinfo about the crawl
= next(records) record
record.rec_type
'warcinfo'
= record.content_stream().read() a
print(a.decode('utf-8'))
isPartOf: CC-MAIN-2020-24
publisher: Common Crawl
description: Wide crawl of the web for May/June 2020
operator: Common Crawl Admin (info@commoncrawl.org)
hostname: ip-10-67-67-182.ec2.internal
software: Apache Nutch 1.16 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.1-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
The next is details about the request to the server
= next(records) record
record.rec_type
'request'
record.rec_headers
StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('WARC-Type', 'request'), ('WARC-Date', '2020-05-25T05:11:44Z'), ('WARC-Record-ID', '<urn:uuid:b14093da-51b7-4f61-8fa5-4630084209d9>'), ('Content-Length', '330'), ('Content-Type', 'application/http; msgtype=request'), ('WARC-Warcinfo-ID', '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>'), ('WARC-IP-Address', '124.156.125.238'), ('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619')])
record.rec_headers.headers
[('WARC-Type', 'request'),
('WARC-Date', '2020-05-25T05:11:44Z'),
('WARC-Record-ID', '<urn:uuid:b14093da-51b7-4f61-8fa5-4630084209d9>'),
('Content-Length', '330'),
('Content-Type', 'application/http; msgtype=request'),
('WARC-Warcinfo-ID', '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>'),
('WARC-IP-Address', '124.156.125.238'),
('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619')]
Shows HTTP headers in the get request
record.http_headers
StatusAndHeaders(protocol = 'GET', statusline = '/related_report/detail.php?id=866619 HTTP/1.1', headers = [('User-Agent', 'CCBot/2.0 (https://commoncrawl.org/faq/)'), ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'), ('Accept-Language', 'en-US,en;q=0.5'), ('If-Modified-Since', 'Fri, 28 Feb 2020 12:03:01 UTC'), ('Accept-Encoding', 'br,gzip'), ('Host', '002397.cn'), ('Connection', 'Keep-Alive')])
record.http_headers.headers
[('User-Agent', 'CCBot/2.0 (https://commoncrawl.org/faq/)'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Accept-Language', 'en-US,en;q=0.5'),
('If-Modified-Since', 'Fri, 28 Feb 2020 12:03:01 UTC'),
('Accept-Encoding', 'br,gzip'),
('Host', '002397.cn'),
('Connection', 'Keep-Alive')]
There’s no data in the request
= record.content_stream().read() a
a
b''
The next item is the response of the previous request
= next(records) record
record.rec_type
'response'
record.rec_headers
StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('WARC-Type', 'response'), ('WARC-Date', '2020-05-25T05:11:44Z'), ('WARC-Record-ID', '<urn:uuid:10bc1a42-8c88-4369-a04e-7b77ca106e79>'), ('Content-Length', '14321'), ('Content-Type', 'application/http; msgtype=response'), ('WARC-Warcinfo-ID', '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>'), ('WARC-Concurrent-To', '<urn:uuid:b14093da-51b7-4f61-8fa5-4630084209d9>'), ('WARC-IP-Address', '124.156.125.238'), ('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619'), ('WARC-Payload-Digest', 'sha1:RWL3CQY47VCKFOXJVZXBQP64U7RCFODH'), ('WARC-Block-Digest', 'sha1:CNOLET4OGLWYCKDJUDAAVYF5YS3MCW4S'), ('WARC-Identified-Payload-Type', 'text/html')])
record.rec_headers.headers
[('WARC-Type', 'response'),
('WARC-Date', '2020-05-25T05:11:44Z'),
('WARC-Record-ID', '<urn:uuid:10bc1a42-8c88-4369-a04e-7b77ca106e79>'),
('Content-Length', '14321'),
('Content-Type', 'application/http; msgtype=response'),
('WARC-Warcinfo-ID', '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>'),
('WARC-Concurrent-To', '<urn:uuid:b14093da-51b7-4f61-8fa5-4630084209d9>'),
('WARC-IP-Address', '124.156.125.238'),
('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619'),
('WARC-Payload-Digest', 'sha1:RWL3CQY47VCKFOXJVZXBQP64U7RCFODH'),
('WARC-Block-Digest', 'sha1:CNOLET4OGLWYCKDJUDAAVYF5YS3MCW4S'),
('WARC-Identified-Payload-Type', 'text/html')]
record.http_headers
StatusAndHeaders(protocol = 'HTTP/1.1', statusline = '200 OK', headers = [('Date', 'Mon, 25 May 2020 05:11:44 GMT'), ('Content-Type', 'text/html'), ('X-Crawler-Content-Length', '6641'), ('Content-Length', '13911'), ('Connection', 'keep-alive'), ('Set-Cookie', 'tgw_l7_route=f60eebbcd438146c92bb28cfca9251e6; Expires=Mon, 25-May-2020 06:11:44 GMT; Path=/'), ('Server', 'Apache/2.4.23 (Unix) OpenSSL/1.0.1e-fips PHP/5.4.16'), ('X-Powered-By', 'PHP/5.4.16'), ('Vary', 'Accept-Encoding'), ('X-Crawler-Content-Encoding', 'gzip')])
record.http_headers.statusline
'200 OK'
record.http_headers.headers
[('Date', 'Mon, 25 May 2020 05:11:44 GMT'),
('Content-Type', 'text/html'),
('X-Crawler-Content-Length', '6641'),
('Content-Length', '13911'),
('Connection', 'keep-alive'),
('Set-Cookie',
'tgw_l7_route=f60eebbcd438146c92bb28cfca9251e6; Expires=Mon, 25-May-2020 06:11:44 GMT; Path=/'),
('Server', 'Apache/2.4.23 (Unix) OpenSSL/1.0.1e-fips PHP/5.4.16'),
('X-Powered-By', 'PHP/5.4.16'),
('Vary', 'Accept-Encoding'),
('X-Crawler-Content-Encoding', 'gzip')]
= record.content_stream().read() a
This contains the full HTML
print(a.decode('utf-8')[:1000])
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>纺织服装行业周报:终端零售回暖,板块业绩等待验证 - 相关研报 - 梦洁股份(002397)</title>
<meta name="mobile-agent" content="format=html5; url=detail_m.php?id=866619" />
<meta name="mobile-agent" content="format=xhtml; url=detail_m.php?id=866619" />
<meta name="keywords" content="纺织服装行业周报:终端零售回暖,板块业绩等待验证,相关研报,梦洁股份,002397"/>
<meta name="description" content="梦洁股份(002397)相关研报:纺织服装行业周报:终端零售回暖,板块业绩等待验证"/>
<link rel="stylesheet" type="text/css" href="http://txt.inv.org.cn/ir/site/pc/css.css"/>
</head>
<body>
<div class="header clearfix">
<div class="logo">
<a href="/" target="_blank"><span>梦洁股份(002397)</span></a>
</div>
<div class="header_meun">
<a href="/index_m.php" target="_blank" style="border:none;">移动版</a>
</div>
<div cl
The next record is metadata about the fetch:
- How long it took to fetch the size
- Detected characterset
- Languages detected
= next(records) record
record.rec_type
'metadata'
record.rec_headers.headers
[('WARC-Type', 'metadata'),
('WARC-Date', '2020-05-25T05:11:44Z'),
('WARC-Record-ID', '<urn:uuid:ce3946a5-f44b-417c-ab8c-3d32e7db40f7>'),
('Content-Length', '201'),
('Content-Type', 'application/warc-fields'),
('WARC-Warcinfo-ID', '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>'),
('WARC-Concurrent-To', '<urn:uuid:10bc1a42-8c88-4369-a04e-7b77ca106e79>'),
('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619')]
= record.content_stream().read() a
print(a.decode('utf-8'))
fetchTimeMs: 731
charset-detected: UTF-8
languages-cld2: {"reliable":true,"text-bytes":8659,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.98,"score":2026.0,"name":"Chinese"}]}
Now we move onto the next request
= next(records) record
'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('request', 'http://003364.cn/j78/453618.html')
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('response', 'http://003364.cn/j78/453618.html')
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://003364.cn/j78/453618.html')
And the next record
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('request', 'http://010yingkelawyer.com/case/2018-09-25/408.html')
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('response', 'http://010yingkelawyer.com/case/2018-09-25/408.html')
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://010yingkelawyer.com/case/2018-09-25/408.html')
And the next
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('request', 'http://023yc.com/az/118080.html')
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('response', 'http://023yc.com/az/118080.html')
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://023yc.com/az/118080.html')
And so on
r.close()
Reading WET
= requests.get(wet_url, stream=True)
r = ArchiveIterator(r.raw) records
First record is information about the crawl
= next(records) record
record.rec_type
'warcinfo'
= record.content_stream().read() a
print(a.decode('utf-8'))
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20200605094634
Extracted-Date: Sun, 07 Jun 2020 16:56:24 GMT
robots: checked via crawler-commons 1.1-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2020-24
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for May/June 2020
publisher: Common Crawl
The WET file doesn’t contain the headers just the title and text.
= next(records) record
record.rec_type
'conversion'
record.rec_headers.headers
[('WARC-Type', 'conversion'),
('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619'),
('WARC-Date', '2020-05-25T05:11:44Z'),
('WARC-Record-ID', '<urn:uuid:3020cc7c-fd30-4f3d-bbd7-8513f33cd83a>'),
('WARC-Refers-To', '<urn:uuid:10bc1a42-8c88-4369-a04e-7b77ca106e79>'),
('WARC-Block-Digest', 'sha1:QBZTGL7G53UVVTAZ5VOKXL3C7LRZ2FUR'),
('WARC-Identified-Content-Language', 'zho'),
('Content-Type', 'text/plain'),
('Content-Length', '9300')]
record.http_headers
= record.content_stream().read() a
The first line is the title of the page, everything else is the text.
print(a.decode('utf-8')[:1000])
纺织服装行业周报:终端零售回暖,板块业绩等待验证 - 相关研报 - 梦洁股份(002397)
梦洁股份(002397)
移动版
首页
股票行情
媒体报道
相关新闻
公司公告
研究报告
相关研报
纺织服装行业周报:终端零售回暖,板块业绩等待验证
发布时间:2017-02-12 研究机构:海通证券
投资要点:
市场回顾:本周(20170206-20170212)纺织服装板块上涨2.21%,跑赢上证综指0.41个百分点,在申万一级行业中列第十一。其中,纺织制造板块上涨2.58%,服装家纺板块上涨1.97%。个股方面,万里马、梦洁股份(002397)、摩登大道、美欣达、金发拉比等个股涨幅居前;探路者、星期六、希努尔、比音勒芬、山东如意跌幅靠前。从PE估值水平来看,纺织服装板块目前估值32.9倍(TTM,剔除负值),其中纺织制造板块32.0倍,服装家纺板块35.4倍。
行业数据:零售方面,春节黄金周零售大幅回升,全国百家重点大型零售企业零售额同比增长2.8%,增速相比上年回升了9.4个百分点。其中服装类商品零售额同比增长4.1%,高于上年春节10.1个百分点。2017年1月份,全国50家重点大型零售企业零售额同比增长17.8%,这一增速与同样包含了春节假期的2014年1月份增速基本持平,高于2012年同期增速4.3个百分点,消费市场显示出较强的活力。出口方面,1月份出口现开门红。2017年1月,我国纺织品出口95.84亿美元,同比增长3.50%,服装及其附件出口143.20亿美元,同比增长1.85%,纺织品服装合计出口239.04亿美元,同比增长2.5%。 周组合跑赢行业指数:跨境通(+7.34%),歌力思(+1.13%),美盛文化(+0.18%),乔治白(+2.23%),按照各1/4的权重,组合收益+2.72%。
周观点: 本周,我们对纺织服装板块2016年业绩前瞻进行了整理,共计57家公司先后发布了2016E业绩预告,其中,33家预告业绩增长,5家预告业绩持平/下滑(+5%~-10%),19家预告业绩下滑。 纺织行业率先回暖。我们认为纺织制造子版块业绩有所改善的主要原因有:1)一方面制造业出口比例较高,受益于人民币贬值带来的出口形势的改善,以及部分汇兑损益对报表带来的正面影响;2)制造业下游客户多为优质品牌商,其中海外龙头品牌由于其全球销售的性质,更广泛享受消费复苏的影响
= next(records)
record record.rec_type
'conversion'
'WARC-Target-URI') record.rec_headers.get_header(
'http://003364.cn/j78/453618.html'
= record.content_stream().read() a
This page seems to be broken PHP?
print(a.decode('utf-8')[:1000])
Can not fopen please check the file or PHP.INI
And the next page
= next(records)
record record.rec_type
'conversion'
'WARC-Target-URI') record.rec_headers.get_header(
'http://010yingkelawyer.com/case/2018-09-25/408.html'
= record.content_stream().read() a
More text
print(a.decode('utf-8')[:1000])
北京刑事律师 彭坤律师辩护北某某非法吸收公众存款/集资诈骗案,成功案例
北京市盈科律师事务所
北京著名刑事辩护律师
13911269079
首页
律师简介
律师文集
业务领域
贪污贿赂
职务犯罪
经济犯罪
涉黑犯罪
海关走私
死刑复核
刑事再审
经典案例
团队风采
荣誉展示
在线留言
联系我们
您现在的位置是:首页 > 经典案例
北京刑事律师 彭坤律师辩护北某某非法吸收公众存款/集资诈骗案,成功案例
发布时间:2018-09-25 15:10:16 浏览次数:
案情简介:
北某某与庞某是夫妻关系,2010年加盟青岛某某投资管理有限公司后于2010年8月31日注册成立某某县银基信息咨询有限公司,公司,非法向社会不特定人员吸收存款344443000元,为维护平台正常运营,包装假标、过期的标,一标多融、加大自融等方式继续吸收资金,最终因客观原因,平台爆雷,截止案发未偿还贷款126060800元,公安机关以非法吸收公众存款罪、集资诈骗罪立案,案件到检后,本人多次与承办检察官沟通,据理力争,成功说服检察官仅以非法吸收公众存款罪追究我的当事人刑事责任。
案件结果:
第一被告犯非法吸收公众存款罪、集资诈骗罪,判处无期徒刑;第二被告犯非法吸收公众存款罪判处5年有期徒刑。
本案的意义:
非法集资案件,对平台负责人来讲,如果平台爆雷,大多多数案件都是以非法吸收公众存款罪、集资诈骗罪追究责任,可想而知,这类案件数额都特别巨大,一旦认定为集资诈骗,就是无期徒刑,大多数平台都存在以后面吸收的资金偿还前面本息的情况,俗称有“拆东墙补西墙”情况,关键是怎么区分非吸还是集资诈骗, “拆东墙补西墙”行为、资金的去向、标的的真假是判断的主要因素,《非法集资解释理解与适用》认为,“拆东墙补西墙”不能单独评价行为是否具有“非法占有为目的”,还应当结合其他情节综合判断,支付本息是非法集资的一个基本特征,在一定意义上,按期支付本金和高额回报反而有可能说明行为人主观上没有非法占有目的,本案,本人成功说服检察官,仅以非法吸收公众存款罪追究北某某的刑事责任。
上一篇: 北京刑事律师 彭坤办理非法利用信息网络案 成功取保候审
下一篇: 北京刑事律师 彭坤主办青岛赵某、孙某某诈骗案 二审撤销原判发回重审
首页 | 律师简介 | 法律资讯 | 业务领域 | 经典案例 | 团队风采 | 在线留言 | 联系我们
Copy
r.close()
Reading WAT
= requests.get(wat_url, stream=True)
r = ArchiveIterator(r.raw) records
Again the first record is a header
= next(records) record
record.rec_type
'warcinfo'
= record.content_stream().read()
a print(a.decode('utf-8'))
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20200605094634
Extracted-Date: Sun, 07 Jun 2020 16:56:24 GMT
ip: 10.67.67.60
hostname: ip-10-67-67-60.ec2.internal
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
The next one is metadata about the WARC records themselves
= next(records) record
record.rec_type
'metadata'
record.rec_headers.headers
[('WARC-Type', 'metadata'),
('WARC-Target-URI', 'CC-MAIN-20200525032636-20200525062636-00381.warc.gz'),
('WARC-Date', '2020-06-07T16:56:24Z'),
('WARC-Record-ID', '<urn:uuid:06070eb0-5afe-4a0c-9c6c-f0d4188414ec>'),
('WARC-Refers-To', '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>'),
('Content-Type', 'application/json'),
('Content-Length', '1239')]
record.http_headers
= record.content_stream().read() a
= json.loads(a.decode('utf-8'))
data data
{'Container': {'Filename': 'CC-MAIN-20200525032636-20200525062636-00381.warc.gz',
'Compressed': True,
'Offset': '0',
'Gzip-Metadata': {'Deflate-Length': '481',
'Header-Length': '10',
'Footer-Length': '8',
'Inflated-CRC': '1190498035',
'Inflated-Length': '766'}},
'Envelope': {'Payload-Metadata': {'Actual-Content-Length': '503',
'Block-Digest': 'sha1:XUOM4YJTGT5VXOY2XJ5KNXDHKNMYUPQA',
'Trailing-Slop-Length': '0',
'Headers-Corrupt': True,
'Actual-Content-Type': 'application/warc-fields',
'WARC-Info-Metadata': {'isPartOf': 'CC-MAIN-2020-24',
'publisher': 'Common Crawl',
'description': 'Wide crawl of the web for May/June 2020',
'operator': 'Common Crawl Admin (info@commoncrawl.org)',
'hostname': 'ip-10-67-67-182.ec2.internal',
'software': 'Apache Nutch 1.16 (modified, https://github.com/commoncrawl/nutch/)',
'robots': 'checked via crawler-commons 1.1-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)',
'format': 'WARC File Format 1.1'}},
'Format': 'WARC',
'WARC-Header-Length': '259',
'WARC-Header-Metadata': {'WARC-Type': 'warcinfo',
'WARC-Date': '2020-05-25T03:26:36Z',
'WARC-Record-ID': '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>',
'Content-Length': '503',
'Content-Type': 'application/warc-fields',
'WARC-Filename': 'CC-MAIN-20200525032636-20200525062636-00381.warc.gz'}}}
The next request contains all the metadata of the first request
= next(records) record
record.rec_type
'metadata'
record.rec_headers.headers
[('WARC-Type', 'metadata'),
('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619'),
('WARC-Date', '2020-06-07T16:56:24Z'),
('WARC-Record-ID', '<urn:uuid:56945f62-e374-4572-9e2c-f5954ed588ed>'),
('WARC-Refers-To', '<urn:uuid:b14093da-51b7-4f61-8fa5-4630084209d9>'),
('Content-Type', 'application/json'),
('Content-Length', '1458')]
record.http_headers
= record.content_stream().read() a
Container shows where the WARC data is, this is about the request
= json.loads(a.decode('utf-8'))
data data
{'Container': {'Filename': 'CC-MAIN-20200525032636-20200525062636-00381.warc.gz',
'Compressed': True,
'Offset': '481',
'Gzip-Metadata': {'Deflate-Length': '479',
'Header-Length': '10',
'Footer-Length': '8',
'Inflated-CRC': '400586359',
'Inflated-Length': '706'}},
'Envelope': {'Payload-Metadata': {'Actual-Content-Type': 'application/http; msgtype=request',
'HTTP-Request-Metadata': {'Request-Message': {'Method': 'GET',
'Path': '/related_report/detail.php?id=866619',
'Version': 'HTTP/1.1'},
'Headers-Length': '328',
'Headers': {'User-Agent': 'CCBot/2.0 (https://commoncrawl.org/faq/)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'If-Modified-Since': 'Fri, 28 Feb 2020 12:03:01 UTC',
'Accept-Encoding': 'br,gzip',
'Host': '002397.cn',
'Connection': 'Keep-Alive'},
'Entity-Length': '0',
'Entity-Digest': 'sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ',
'Entity-Trailing-Slop-Length': '0'},
'Actual-Content-Length': '330',
'Block-Digest': 'sha1:PVCMTRJPX7C5ZEWYAALER3IPUBR5A7S7',
'Trailing-Slop-Length': '4'},
'Format': 'WARC',
'WARC-Header-Length': '372',
'WARC-Header-Metadata': {'WARC-Type': 'request',
'WARC-Date': '2020-05-25T05:11:44Z',
'WARC-Record-ID': '<urn:uuid:b14093da-51b7-4f61-8fa5-4630084209d9>',
'Content-Length': '330',
'Content-Type': 'application/http; msgtype=request',
'WARC-Warcinfo-ID': '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>',
'WARC-IP-Address': '124.156.125.238',
'WARC-Target-URI': 'http://002397.cn/related_report/detail.php?id=866619'}}}
Notice it’s HTTP-Request-Metadata
'Envelope'] data[
{'Payload-Metadata': {'Actual-Content-Type': 'application/http; msgtype=request',
'HTTP-Request-Metadata': {'Request-Message': {'Method': 'GET',
'Path': '/related_report/detail.php?id=866619',
'Version': 'HTTP/1.1'},
'Headers-Length': '328',
'Headers': {'User-Agent': 'CCBot/2.0 (https://commoncrawl.org/faq/)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'If-Modified-Since': 'Fri, 28 Feb 2020 12:03:01 UTC',
'Accept-Encoding': 'br,gzip',
'Host': '002397.cn',
'Connection': 'Keep-Alive'},
'Entity-Length': '0',
'Entity-Digest': 'sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ',
'Entity-Trailing-Slop-Length': '0'},
'Actual-Content-Length': '330',
'Block-Digest': 'sha1:PVCMTRJPX7C5ZEWYAALER3IPUBR5A7S7',
'Trailing-Slop-Length': '4'},
'Format': 'WARC',
'WARC-Header-Length': '372',
'WARC-Header-Metadata': {'WARC-Type': 'request',
'WARC-Date': '2020-05-25T05:11:44Z',
'WARC-Record-ID': '<urn:uuid:b14093da-51b7-4f61-8fa5-4630084209d9>',
'Content-Length': '330',
'Content-Type': 'application/http; msgtype=request',
'WARC-Warcinfo-ID': '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>',
'WARC-IP-Address': '124.156.125.238',
'WARC-Target-URI': 'http://002397.cn/related_report/detail.php?id=866619'}}
And the next one is about the response
= next(records) record
record.rec_type
'metadata'
record.rec_headers.headers
[('WARC-Type', 'metadata'),
('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619'),
('WARC-Date', '2020-06-07T16:56:25Z'),
('WARC-Record-ID', '<urn:uuid:c72d34ed-8279-4653-8bd3-a53b0b072dc8>'),
('WARC-Refers-To', '<urn:uuid:10bc1a42-8c88-4369-a04e-7b77ca106e79>'),
('Content-Type', 'application/json'),
('Content-Length', '4048')]
record.http_headers
= record.content_stream().read() a
Envelope contains the details
= json.loads(a.decode('utf-8'))
data data
{'Container': {'Filename': 'CC-MAIN-20200525032636-20200525062636-00381.warc.gz',
'Compressed': True,
'Offset': '960',
'Gzip-Metadata': {'Deflate-Length': '7317',
'Header-Length': '10',
'Footer-Length': '8',
'Inflated-CRC': '-219204245',
'Inflated-Length': '14929'}},
'Envelope': {'Payload-Metadata': {'Actual-Content-Type': 'application/http; msgtype=response',
'HTTP-Response-Metadata': {'Response-Message': {'Status': '200',
'Version': 'HTTP/1.1',
'Reason': 'OK'},
'Headers-Length': '410',
'Headers': {'Date': 'Mon, 25 May 2020 05:11:44 GMT',
'Content-Type': 'text/html',
'X-Crawler-Content-Length': '6641',
'Content-Length': '13911',
'Connection': 'keep-alive',
'Set-Cookie': 'tgw_l7_route=f60eebbcd438146c92bb28cfca9251e6; Expires=Mon, 25-May-2020 06:11:44 GMT; Path=/',
'Server': 'Apache/2.4.23 (Unix) OpenSSL/1.0.1e-fips PHP/5.4.16',
'X-Powered-By': 'PHP/5.4.16',
'Vary': 'Accept-Encoding',
'X-Crawler-Content-Encoding': 'gzip'},
'HTML-Metadata': {'Head': {'Title': '纺织服装行业周报:终端零售回暖,板块业绩等待验证 - 相关研报 - 梦洁股份(002397)',
'Metas': [{'name': 'mobile-agent',
'content': 'format=html5; url=detail_m.php?id=866619'},
{'name': 'mobile-agent',
'content': 'format=xhtml; url=detail_m.php?id=866619'},
{'name': 'keywords',
'content': '纺织服装行业周报:终端零售回暖,板块业绩等待验证,相关研报,梦洁股份,002397'},
{'name': 'description',
'content': '梦洁股份(002397)相关研报:纺织服装行业周报:终端零售回暖,板块业绩等待验证'}],
'Link': [{'path': 'LINK@/href',
'url': 'http://txt.inv.org.cn/ir/site/pc/css.css',
'rel': 'stylesheet',
'type': 'text/css'}],
'Scripts': [{'path': 'SCRIPT@/src',
'url': 'http://static.bshare.cn/b/buttonLite.js#style=-1&uuid=&pophcol=2&lang=zh',
'type': 'text/javascript'},
{'path': 'SCRIPT@/src',
'url': 'http://static.bshare.cn/b/bshareC0.js',
'type': 'text/javascript'},
{'path': 'SCRIPT@/src',
'url': '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'},
{'path': 'SCRIPT@/src',
'url': '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'}]},
'Links': [{'path': 'A@/href',
'url': '/',
'target': '_blank',
'text': '梦洁股份(002397)'},
{'path': 'A@/href',
'url': '/index_m.php',
'target': '_blank',
'text': '移动版'},
{'path': 'IMG@/src',
'url': 'http://img.inv.org.cn/broker/huasheng_pc.jpg'},
{'path': 'A@/href',
'url': 'https://hd.hstong.com/marketing/2019/0228?_scnl=OTg0NWJibzY0MTI3',
'target': '_blank'},
{'path': 'A@/href', 'url': '/', 'text': '首页'},
{'path': 'A@/href', 'url': '/quote/', 'text': '股票行情'},
{'path': 'A@/href', 'url': '/media_news/', 'text': '媒体报道'},
{'path': 'A@/href', 'url': '/related_news/', 'text': '相关新闻'},
{'path': 'A@/href', 'url': '/notice/', 'text': '公司公告'},
{'path': 'A@/href', 'url': '/report/', 'text': '研究报告'},
{'path': 'A@/href', 'url': '/related_report/', 'text': '相关研报'},
{'path': 'A@/href', 'url': '/', 'target': '_blank', 'text': '梦洁股份'},
{'path': 'A@/href', 'url': '/', 'target': '_blank', 'text': '002397'},
{'path': 'A@/href',
'url': 'http://www.bShare.cn/',
'title': '分享到',
'text': '分享到'},
{'path': 'IMG@/src', 'url': 'http://img.inv.org.cn/ad/zixun_pc.jpg'},
{'path': 'A@/href',
'url': 'http://stock.inv.org.cn',
'target': '_blank',
'text': '股票投资之家'}]},
'Entity-Length': '13911',
'Entity-Digest': 'sha1:RWL3CQY47VCKFOXJVZXBQP64U7RCFODH',
'Entity-Trailing-Slop-Length': '0'},
'Actual-Content-Length': '14321',
'Block-Digest': 'sha1:CNOLET4OGLWYCKDJUDAAVYF5YS3MCW4S',
'Trailing-Slop-Length': '4'},
'Format': 'WARC',
'WARC-Header-Length': '604',
'WARC-Header-Metadata': {'WARC-Type': 'response',
'WARC-Date': '2020-05-25T05:11:44Z',
'WARC-Record-ID': '<urn:uuid:10bc1a42-8c88-4369-a04e-7b77ca106e79>',
'Content-Length': '14321',
'Content-Type': 'application/http; msgtype=response',
'WARC-Warcinfo-ID': '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>',
'WARC-Concurrent-To': '<urn:uuid:b14093da-51b7-4f61-8fa5-4630084209d9>',
'WARC-IP-Address': '124.156.125.238',
'WARC-Target-URI': 'http://002397.cn/related_report/detail.php?id=866619',
'WARC-Payload-Digest': 'sha1:RWL3CQY47VCKFOXJVZXBQP64U7RCFODH',
'WARC-Block-Digest': 'sha1:CNOLET4OGLWYCKDJUDAAVYF5YS3MCW4S',
'WARC-Identified-Payload-Type': 'text/html'}}}
Here we’ve got the HTTP headers and response metadata
'Envelope']['Payload-Metadata'] data[
{'Actual-Content-Type': 'application/http; msgtype=response',
'HTTP-Response-Metadata': {'Response-Message': {'Status': '200',
'Version': 'HTTP/1.1',
'Reason': 'OK'},
'Headers-Length': '410',
'Headers': {'Date': 'Mon, 25 May 2020 05:11:44 GMT',
'Content-Type': 'text/html',
'X-Crawler-Content-Length': '6641',
'Content-Length': '13911',
'Connection': 'keep-alive',
'Set-Cookie': 'tgw_l7_route=f60eebbcd438146c92bb28cfca9251e6; Expires=Mon, 25-May-2020 06:11:44 GMT; Path=/',
'Server': 'Apache/2.4.23 (Unix) OpenSSL/1.0.1e-fips PHP/5.4.16',
'X-Powered-By': 'PHP/5.4.16',
'Vary': 'Accept-Encoding',
'X-Crawler-Content-Encoding': 'gzip'},
'HTML-Metadata': {'Head': {'Title': '纺织服装行业周报:终端零售回暖,板块业绩等待验证 - 相关研报 - 梦洁股份(002397)',
'Metas': [{'name': 'mobile-agent',
'content': 'format=html5; url=detail_m.php?id=866619'},
{'name': 'mobile-agent',
'content': 'format=xhtml; url=detail_m.php?id=866619'},
{'name': 'keywords',
'content': '纺织服装行业周报:终端零售回暖,板块业绩等待验证,相关研报,梦洁股份,002397'},
{'name': 'description',
'content': '梦洁股份(002397)相关研报:纺织服装行业周报:终端零售回暖,板块业绩等待验证'}],
'Link': [{'path': 'LINK@/href',
'url': 'http://txt.inv.org.cn/ir/site/pc/css.css',
'rel': 'stylesheet',
'type': 'text/css'}],
'Scripts': [{'path': 'SCRIPT@/src',
'url': 'http://static.bshare.cn/b/buttonLite.js#style=-1&uuid=&pophcol=2&lang=zh',
'type': 'text/javascript'},
{'path': 'SCRIPT@/src',
'url': 'http://static.bshare.cn/b/bshareC0.js',
'type': 'text/javascript'},
{'path': 'SCRIPT@/src',
'url': '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'},
{'path': 'SCRIPT@/src',
'url': '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'}]},
'Links': [{'path': 'A@/href',
'url': '/',
'target': '_blank',
'text': '梦洁股份(002397)'},
{'path': 'A@/href',
'url': '/index_m.php',
'target': '_blank',
'text': '移动版'},
{'path': 'IMG@/src',
'url': 'http://img.inv.org.cn/broker/huasheng_pc.jpg'},
{'path': 'A@/href',
'url': 'https://hd.hstong.com/marketing/2019/0228?_scnl=OTg0NWJibzY0MTI3',
'target': '_blank'},
{'path': 'A@/href', 'url': '/', 'text': '首页'},
{'path': 'A@/href', 'url': '/quote/', 'text': '股票行情'},
{'path': 'A@/href', 'url': '/media_news/', 'text': '媒体报道'},
{'path': 'A@/href', 'url': '/related_news/', 'text': '相关新闻'},
{'path': 'A@/href', 'url': '/notice/', 'text': '公司公告'},
{'path': 'A@/href', 'url': '/report/', 'text': '研究报告'},
{'path': 'A@/href', 'url': '/related_report/', 'text': '相关研报'},
{'path': 'A@/href', 'url': '/', 'target': '_blank', 'text': '梦洁股份'},
{'path': 'A@/href', 'url': '/', 'target': '_blank', 'text': '002397'},
{'path': 'A@/href',
'url': 'http://www.bShare.cn/',
'title': '分享到',
'text': '分享到'},
{'path': 'IMG@/src', 'url': 'http://img.inv.org.cn/ad/zixun_pc.jpg'},
{'path': 'A@/href',
'url': 'http://stock.inv.org.cn',
'target': '_blank',
'text': '股票投资之家'}]},
'Entity-Length': '13911',
'Entity-Digest': 'sha1:RWL3CQY47VCKFOXJVZXBQP64U7RCFODH',
'Entity-Trailing-Slop-Length': '0'},
'Actual-Content-Length': '14321',
'Block-Digest': 'sha1:CNOLET4OGLWYCKDJUDAAVYF5YS3MCW4S',
'Trailing-Slop-Length': '4'}
'Envelope']['Payload-Metadata']['HTTP-Response-Metadata'] data[
{'Response-Message': {'Status': '200', 'Version': 'HTTP/1.1', 'Reason': 'OK'},
'Headers-Length': '410',
'Headers': {'Date': 'Mon, 25 May 2020 05:11:44 GMT',
'Content-Type': 'text/html',
'X-Crawler-Content-Length': '6641',
'Content-Length': '13911',
'Connection': 'keep-alive',
'Set-Cookie': 'tgw_l7_route=f60eebbcd438146c92bb28cfca9251e6; Expires=Mon, 25-May-2020 06:11:44 GMT; Path=/',
'Server': 'Apache/2.4.23 (Unix) OpenSSL/1.0.1e-fips PHP/5.4.16',
'X-Powered-By': 'PHP/5.4.16',
'Vary': 'Accept-Encoding',
'X-Crawler-Content-Encoding': 'gzip'},
'HTML-Metadata': {'Head': {'Title': '纺织服装行业周报:终端零售回暖,板块业绩等待验证 - 相关研报 - 梦洁股份(002397)',
'Metas': [{'name': 'mobile-agent',
'content': 'format=html5; url=detail_m.php?id=866619'},
{'name': 'mobile-agent',
'content': 'format=xhtml; url=detail_m.php?id=866619'},
{'name': 'keywords',
'content': '纺织服装行业周报:终端零售回暖,板块业绩等待验证,相关研报,梦洁股份,002397'},
{'name': 'description',
'content': '梦洁股份(002397)相关研报:纺织服装行业周报:终端零售回暖,板块业绩等待验证'}],
'Link': [{'path': 'LINK@/href',
'url': 'http://txt.inv.org.cn/ir/site/pc/css.css',
'rel': 'stylesheet',
'type': 'text/css'}],
'Scripts': [{'path': 'SCRIPT@/src',
'url': 'http://static.bshare.cn/b/buttonLite.js#style=-1&uuid=&pophcol=2&lang=zh',
'type': 'text/javascript'},
{'path': 'SCRIPT@/src',
'url': 'http://static.bshare.cn/b/bshareC0.js',
'type': 'text/javascript'},
{'path': 'SCRIPT@/src',
'url': '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'},
{'path': 'SCRIPT@/src',
'url': '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'}]},
'Links': [{'path': 'A@/href',
'url': '/',
'target': '_blank',
'text': '梦洁股份(002397)'},
{'path': 'A@/href',
'url': '/index_m.php',
'target': '_blank',
'text': '移动版'},
{'path': 'IMG@/src', 'url': 'http://img.inv.org.cn/broker/huasheng_pc.jpg'},
{'path': 'A@/href',
'url': 'https://hd.hstong.com/marketing/2019/0228?_scnl=OTg0NWJibzY0MTI3',
'target': '_blank'},
{'path': 'A@/href', 'url': '/', 'text': '首页'},
{'path': 'A@/href', 'url': '/quote/', 'text': '股票行情'},
{'path': 'A@/href', 'url': '/media_news/', 'text': '媒体报道'},
{'path': 'A@/href', 'url': '/related_news/', 'text': '相关新闻'},
{'path': 'A@/href', 'url': '/notice/', 'text': '公司公告'},
{'path': 'A@/href', 'url': '/report/', 'text': '研究报告'},
{'path': 'A@/href', 'url': '/related_report/', 'text': '相关研报'},
{'path': 'A@/href', 'url': '/', 'target': '_blank', 'text': '梦洁股份'},
{'path': 'A@/href', 'url': '/', 'target': '_blank', 'text': '002397'},
{'path': 'A@/href',
'url': 'http://www.bShare.cn/',
'title': '分享到',
'text': '分享到'},
{'path': 'IMG@/src', 'url': 'http://img.inv.org.cn/ad/zixun_pc.jpg'},
{'path': 'A@/href',
'url': 'http://stock.inv.org.cn',
'target': '_blank',
'text': '股票投资之家'}]},
'Entity-Length': '13911',
'Entity-Digest': 'sha1:RWL3CQY47VCKFOXJVZXBQP64U7RCFODH',
'Entity-Trailing-Slop-Length': '0'}
Contains from the head the title, metas and scripts, as well as links from the text itself.
'Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata'] data[
{'Head': {'Title': '纺织服装行业周报:终端零售回暖,板块业绩等待验证 - 相关研报 - 梦洁股份(002397)',
'Metas': [{'name': 'mobile-agent',
'content': 'format=html5; url=detail_m.php?id=866619'},
{'name': 'mobile-agent',
'content': 'format=xhtml; url=detail_m.php?id=866619'},
{'name': 'keywords',
'content': '纺织服装行业周报:终端零售回暖,板块业绩等待验证,相关研报,梦洁股份,002397'},
{'name': 'description',
'content': '梦洁股份(002397)相关研报:纺织服装行业周报:终端零售回暖,板块业绩等待验证'}],
'Link': [{'path': 'LINK@/href',
'url': 'http://txt.inv.org.cn/ir/site/pc/css.css',
'rel': 'stylesheet',
'type': 'text/css'}],
'Scripts': [{'path': 'SCRIPT@/src',
'url': 'http://static.bshare.cn/b/buttonLite.js#style=-1&uuid=&pophcol=2&lang=zh',
'type': 'text/javascript'},
{'path': 'SCRIPT@/src',
'url': 'http://static.bshare.cn/b/bshareC0.js',
'type': 'text/javascript'},
{'path': 'SCRIPT@/src',
'url': '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'},
{'path': 'SCRIPT@/src',
'url': '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'}]},
'Links': [{'path': 'A@/href',
'url': '/',
'target': '_blank',
'text': '梦洁股份(002397)'},
{'path': 'A@/href',
'url': '/index_m.php',
'target': '_blank',
'text': '移动版'},
{'path': 'IMG@/src', 'url': 'http://img.inv.org.cn/broker/huasheng_pc.jpg'},
{'path': 'A@/href',
'url': 'https://hd.hstong.com/marketing/2019/0228?_scnl=OTg0NWJibzY0MTI3',
'target': '_blank'},
{'path': 'A@/href', 'url': '/', 'text': '首页'},
{'path': 'A@/href', 'url': '/quote/', 'text': '股票行情'},
{'path': 'A@/href', 'url': '/media_news/', 'text': '媒体报道'},
{'path': 'A@/href', 'url': '/related_news/', 'text': '相关新闻'},
{'path': 'A@/href', 'url': '/notice/', 'text': '公司公告'},
{'path': 'A@/href', 'url': '/report/', 'text': '研究报告'},
{'path': 'A@/href', 'url': '/related_report/', 'text': '相关研报'},
{'path': 'A@/href', 'url': '/', 'target': '_blank', 'text': '梦洁股份'},
{'path': 'A@/href', 'url': '/', 'target': '_blank', 'text': '002397'},
{'path': 'A@/href',
'url': 'http://www.bShare.cn/',
'title': '分享到',
'text': '分享到'},
{'path': 'IMG@/src', 'url': 'http://img.inv.org.cn/ad/zixun_pc.jpg'},
{'path': 'A@/href',
'url': 'http://stock.inv.org.cn',
'target': '_blank',
'text': '股票投资之家'}]}
The next record corresponds to the metadata of the request
= next(records) record
record.rec_type
'metadata'
record.rec_headers.headers
[('WARC-Type', 'metadata'),
('WARC-Target-URI', 'http://002397.cn/related_report/detail.php?id=866619'),
('WARC-Date', '2020-06-07T16:56:25Z'),
('WARC-Record-ID', '<urn:uuid:0802c5dc-77da-405c-851a-63b642fa2cef>'),
('WARC-Refers-To', '<urn:uuid:ce3946a5-f44b-417c-ab8c-3d32e7db40f7>'),
('Content-Type', 'application/json'),
('Content-Length', '1243')]
= record.content_stream().read() a
This envelope contains WARC-Metadata-Metadata, this covers all the actual metadata in the metadata record.
= json.loads(a)
data data
{'Container': {'Filename': 'CC-MAIN-20200525032636-20200525062636-00381.warc.gz',
'Compressed': True,
'Offset': '8277',
'Gzip-Metadata': {'Deflate-Length': '434',
'Header-Length': '10',
'Footer-Length': '8',
'Inflated-CRC': '-1977350947',
'Inflated-Length': '603'}},
'Envelope': {'Payload-Metadata': {'Actual-Content-Type': 'application/metadata-fields',
'WARC-Metadata-Metadata': {'Metadata-Records': [{'Name': 'fetchTimeMs',
'Value': '731'},
{'Name': 'charset-detected', 'Value': 'UTF-8'},
{'Name': 'languages-cld2',
'Value': '{"reliable":true,"text-bytes":8659,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.98,"score":2026.0,"name":"Chinese"}]}'}]},
'Actual-Content-Length': '201',
'Block-Digest': 'sha1:ZFJEHS5NUU3WCOEYR63VLFIQIYHFSN7I',
'Trailing-Slop-Length': '0'},
'Format': 'WARC',
'WARC-Header-Length': '398',
'WARC-Header-Metadata': {'WARC-Type': 'metadata',
'WARC-Date': '2020-05-25T05:11:44Z',
'WARC-Record-ID': '<urn:uuid:ce3946a5-f44b-417c-ab8c-3d32e7db40f7>',
'Content-Length': '201',
'Content-Type': 'application/warc-fields',
'WARC-Warcinfo-ID': '<urn:uuid:40b0c676-a143-44c9-bde5-ad0e9999cb04>',
'WARC-Concurrent-To': '<urn:uuid:10bc1a42-8c88-4369-a04e-7b77ca106e79>',
'WARC-Target-URI': 'http://002397.cn/related_report/detail.php?id=866619'}}}
And so on for the next few requests
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://003364.cn/j78/453618.html')
= json.loads(record.content_stream().read())
data 'Envelope']['Payload-Metadata'].keys() data[
dict_keys(['Actual-Content-Type', 'HTTP-Request-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://003364.cn/j78/453618.html')
= json.loads(record.content_stream().read())
data 'Envelope']['Payload-Metadata'].keys() data[
dict_keys(['Actual-Content-Type', 'HTTP-Response-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://003364.cn/j78/453618.html')
= json.loads(record.content_stream().read())
data 'Envelope']['Payload-Metadata'].keys() data[
dict_keys(['Actual-Content-Type', 'WARC-Metadata-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])
And so on
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://010yingkelawyer.com/case/2018-09-25/408.html')
= json.loads(record.content_stream().read())
data 'Envelope']['Payload-Metadata'].keys() data[
dict_keys(['Actual-Content-Type', 'HTTP-Request-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://010yingkelawyer.com/case/2018-09-25/408.html')
= json.loads(record.content_stream().read())
data 'Envelope']['Payload-Metadata'].keys() data[
dict_keys(['Actual-Content-Type', 'HTTP-Response-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])
= next(records)
record 'WARC-Target-URI') record.rec_type, record.rec_headers.get_header(
('metadata', 'http://010yingkelawyer.com/case/2018-09-25/408.html')
= json.loads(record.content_stream().read())
data 'Envelope']['Payload-Metadata'].keys() data[
dict_keys(['Actual-Content-Type', 'WARC-Metadata-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])
r.close()